If you’ve ever searched online for a recipe or asked “what should I make for dinner” in a search engine, one of the results is most likely from allrecipes.com.
Allrecipes.com is recipe-sharing platform with over 100,000 recipes and 60 million users globally. Users can submit their own recipes as well as interact with other users by commenting, reviewing, or rating recipes. Allrecipes is unique as it is a public forum for recipes and a community for anyone who wants to cook, rather than a carefully curated blog. You can find family heirlooms like Jewish Grandma’s Best Beef Brisket or simple recipes like these basic crepes.
In addition to being a great resource when you’re in a cooking rut, it has a trove of data on each page. So much so, that Brian Mubia took notice and decided to create the tastyR package.
The tastyR package contains two datasets, allrecipes and cuisines. For this project, we’ll dive into cuisines - a dataset containing over 2,000 recipes from allrecipes with information on ingredients, cuisine, nutrition, reviews, and ratings.
After reviewing the cuisines dataset, I was most interested in two components: the cuisine variable, which closely parallels country of origin, and the ingredients variable. We often group country’s cuisines into broader categories by geographic location — for example, Italy, Greece, and Turkey are commonly considered Mediterranean food. I was curious to see whether that intuition holds up based on ingredient usage or if we see that geographically distant cuisines have more in common than we might have assumed.
To help guide the project, I came up with some questions to try and answer.
To answer these questions, we will:
Before diving into the data analysis, I want to go over the contents of the dataset,data cleaning steps and lastly,how ingredients were tokenized.
| variable | type | label |
|---|---|---|
| country | character | Cuisine |
| name | character | Name of Recipe |
| url | character | URL |
| author | character | Author |
| date_published | Date | Date Published or Last Updated |
| ingredients | character | List of Ingredients |
| calories | integer | Calories per Serving |
| fat | integer | Fat per Serving |
| carbs | integer | Carbs per Serving |
| protein | integer | Protein per Serving |
| avg_rating | numeric | Average Ratings |
| total_ratings | integer | Total Number of Ratings |
| reviews | integer | Total Number of Reviews |
| prep_time | integer | Prep Time (in minutes) |
| cook_time | integer | Cook Time (in minutes) |
| total_time | integer | Total Time (in minutes) |
| servings | integer | Number of Servings |
The cuisines dataset contains 2,218 records with 17 different variables, described above. Records are uniquely identified by name and author.
Ingredients are comma-delimited with measurement and units but not standardized. Fat, carbs, and protein are measured in grams. Ratings are on a 1-star to 5-star rating scale.
Something to note is that total ratings and reviews are erroneously truncated to the thousands, unless there were less than 1000 ratings total. This was discovered when creating frequency tables and confirmed online (Github - TidyTuesday Data).
In the exploratory data analysis, we will cover some basic descriptive statistics on some of these variables to give us a general idea of the recipes we are working with.
For data cleaning, the following steps were taken:
stringdistmatrix,hclust, and
cutree. The threshold for cutree was 0.05 and
duplicates were manually reviewed to ensure that they were similar
enough to be considered the same and the threshold was appropriate.Outliers in numeric variables were not examined as these variables will be not used in the main analysis. After data cleaning, 9 records were removed from the initial dataset.
Below shows example of what the raw ingredients variable look like.
## ingredients
## 1 1 pound sliced bacon, diced, 1 medium sweet onion, chopped, 9 large eggs, lightly beaten, 4 cups frozen shredded hash brown potatoes, thawed, 2 cups shredded Cheddar cheese, 1 ½ cups small curd cottage cheese, 1 ¼ cups shredded Swiss cheese
## 2 3 egg yolks, 1 tablespoon lemon juice, ¼ teaspoon Dijon mustard, 1 dash hot pepper sauce (e.g. Tabasco™), ½ cup butter
## 3 oil for deep frying, 1 cup unbleached all-purpose flour, 2 teaspoons salt, ½ teaspoon ground black pepper, ½ teaspoon cayenne pepper, ½ teaspoon paprika, ¼ teaspoon garlic powder, 1 large egg, 1 cup milk, 3 skinless, boneless chicken breasts, cut into 1/2-inch strips, ¼ cup hot pepper sauce, 1 tablespoon butter
## 4 1 orange, 1 lemon, 1 lime, 1 (750 milliliter) bottle dry red wine, 1 ½ cups rum, 1 cup orange juice, ½ cup white sugar, or to taste
## 5 4 skinless, boneless chicken breast halves - pounded to ½-inch thickness, salt and pepper to taste, 2 tablespoons all-purpose flour, 1 egg, beaten, 1 cup panko bread crumbs, 1 cup oil for frying, or as needed
As you can see, it is contains a lot of information, represented in different forms. For example, there is additional text within parentheses as well as measurements and method of preparation (ex. “chopped”).
For the purposes of PCA and t-SNE, we will standardized and tokenized the ingredients so we get one row per ingredient per recipe, with no measurement, unit, or additional information. Adjectives that are unnecessary such as small,large etc. will be removed.
The following steps were taken:
stringdistmatrix,hclust, and
cutree like above.The threshold for cutree was
0.10 and clusters were manually reviewed to ensure that they were true
misspellings.Standardizing the ingredients was not a trivial effort. Given the amount of recipes and ingredients, it is not guaranteed that every case was accounted for in this step.
After these steps, this is what the ingredients column became:
## food
## 1 bacon
## 2 onion
## 3 egg
## 4 potato
## 5 cheddar cheese
## 6 cottage cheese
## 7 swiss cheese
## 8 yolk
## 9 lemon juice
## 10 dijon mustard
## 11 pepper sauce
## 12 butter
## 13 flour
## 14 salt
## 15 pepper
## 16 cayenne pepper
## 17 paprika
## 18 powder
## 19 egg
## 20 milk
In the data analysis section, we will discuss the top ingredients.
To get a sense of the data we are working with, I produced some basic graphs and a table with descriptive statistics on most of the variables.
We can see above that we are dealing with recipes mostly from the last 5 years, so they should reflect current food trends.
The above graphs display the spread of nutritional variables. With 50% of recipes having less than 11 grams of protein, we may have more recipes that are vegetarian rather than meat-based. The median value of calories is about 320 and so the recipes are most likely moderate and not overly indulgent.
The table below includes a more numerical look at the raw variables if intrigued.
| variable | level | statistics |
|---|---|---|
| Number of Records | NA | 2209 |
| date_published | [2005,2010) | 1 (0.05%) |
| date_published | [2010,2015) | 10 (0.45%) |
| date_published | [2015,2020) | 48 (2.17%) |
| date_published | [2020,2025] | 2150 (97.33%) |
| calories | mean(sd) | 358.16 (239.27) |
| calories | median(q1,q3) | 319.5 (190, 477) |
| fat | mean(sd) | 18.76 (16.96) |
| fat | median(q1,q3) | 15 (7, 26) |
| carbs | mean(sd) | 31.87 (25.87) |
| carbs | median(q1,q3) | 26 (13, 45) |
| protein | mean(sd) | 16.61 (16.3) |
| protein | median(q1,q3) | 11 (4, 25) |
| avg_rating | mean(sd) | 4.51 (0.4) |
| avg_rating | median(q1,q3) | 4.6 (4.3, 4.8) |
| reviews | mean(sd) | 77.06 (142.25) |
| reviews | median(q1,q3) | 21 (6, 74) |
| prep_time | mean(sd) | 21.53 (60.84) |
| prep_time | median(q1,q3) | 15 (10, 25) |
| cook_time | mean(sd) | 41.8 (63.23) |
| cook_time | median(q1,q3) | 25 (10, 45) |
| total_time | mean(sd) | 171 (642.8) |
| total_time | median(q1,q3) | 60 (35, 120) |
| servings | mean(sd) | 10.47 (13.44) |
| servings | median(q1,q3) | 8 (4, 12) |
Next, we will look at the two critical variables for this analysis - cuisine and ingredients. Below are graphs detailing the proportion of cuisines by recipe and the most common ingredients.
There are over 40 different cuisines, most of which are directly
related to a single country with the exception of a few such as Jewish,
Cajun and Creole, Amish and Mennonite, and Southern Recipes. The largest
percentage of recipes comes from Brazil and Filipino while the lowest is
Belgian. There are several missing cuisines on this dataset and it is a
limitation in this analysis. For example, we do not have any data on
countries in Africa besides South Africa.
Returning to the top ingredients, after standardizing and tokenizing,
there were over a thousand uniquely named ingredients. I calculated the
frequency of the ingredient appearing in the 2,209 recipes and graphed
the proportion of recipes with the top 25 ingredients.
The top 25 ingredients are not very surprising and make sense intuitively. I am a little surprised that onion is first compared to salt, sugar, and water, though. Cinnamon and vanilla extract were also in the top 25 so that suggests to me that there are several bakery recipes like cookies or cinnamon rolls.
Now to figure out how ingredient usages breaks down by cuisine, I produced a heat map and added 25 more ingredients to make it a little more interesting. I organized the cuisines on the y-axis by the UN Geoscheme for regions and when a cuisine was not a country, I grouped it with the region with which the cuisine is typically associated with.The only cuisine I did not do for was Jewish cuisine since their cuisine covers multiple areas and are dispersed globally. We will use these groupings for visualization with T-SNE later. Some things that stood out to me were:
For fun, I also created histograms of the top 5 ingredients in each country after removing the top 10 ingredients like salt and sugar. The top 5 ingredients made intuitive sense to me but I could also see that in the recipes included in this dataset are not representative of an entire cuisine, but rather more popular recipes.
Below outlines the steps to take the recipes’ ingredients and see if
we can see any structure or clusters through t-SNE. ### Data Set Up -
Created a binary matrix where each row represents a recipe and each
column represents an ingredient - Marked presence of an ingredient with
1 and absence with 0 - Ensured each
recipe–ingredient pair is unique - Converted the table into a matrix and
scaled the matrix for analysis
X, Y) for each
recipe in the t-SNE plotTo see if we can visualize recipes with similar ingredient profiles, we used PCA to reduce dimensionality and applied t-SNE to visualize local neighborhoods. You can see that Asian recipes, particularly, Southern, Eastern, and South-east Asian, are located near each other. You can also see a couple of small North American and South American clusters. European recipes are spread out. One thing I did notice is that sweeter recipes were located by one another, additionally so were cocktail/drink recipes. I think the t-SNE might be showing us flavors or types of meals like desserts. If that is the case, it would explain why we see those Asian recipes together because there was more homogeneity in the type of recipes when I went through the names. Most of them were savory dinner recipes. The other cuisines, particular those in Europe, seem to have more of a balance with both savory and sweet dishes with less regionally-specific ingredients.
There are some limitations that I want to acknowledge. The cuisines dataset is a relatively small sample that does not capture global cuisine. Additionally, since allrecipes.com is a U.S. owned company, we are going to have more recipes that for North American and English‑speaking audiences. For example, recipes labeled “Chinese” for their cuisine may be American-ized versions and don’t accurately represent traditional cuisine. Lastly, food often crosses borders and having one cuisine tagged to it ignores the impact of colonization, immigration, fusion and globalization.
If I were to continue this project, I’d like to do the following: